Regression Modelling part 1
Now that we are comfortable with visualising and manipulating data in R, we can now proceed onto modelling data.
There are many different modelling techniques. However, we will begin with one of the easier to understand and commonly-used approaches,linear regression.
ILO’s for today:
We shall now load into R all of the libraries we will need for this session. This can be done by typing the following into your R script:
ggplot2 and tidyverse, are used for data wrangling and visualization.performance) can be used to assess the model fit of linear regression models.skimr will be used to produce summary statistics for our data.sjPlot package will be used to summarise the output of linear models as well as performing some diagnostic plots.Teaching evaluations at the UT Austin
Professor evaluation scores feedback obtained from end of semester student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austin.
We will examine the evaluation scores of the instructors based purely on one numerical variable their beauty score.
The goal: Explore the relationship between the average professor evaluation score and the average beauty rating of professor.
Gapminder data
The Gapminder dataset is a comprehensive collection of global statistics on health, wealth, and development over time, spanning countries and decades.
We will examine life expectancy data from 2007 of every country on each continent.
The goal: Explore whether there are differences in the mean life expectancy across continents
Credit Card Balance Data
A simulated data set containing information on credit card balances, demographics, and financial attributes for a sample of individuals.
We will examine the impact that individuals credit limit and income have on their credit card balance.
The goal: Describe the relationship between credit balance and these two explanatory variables.
Quick univariate summaries via the skimr package.
| Name | evals.scores |
| Number of rows | 463 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0 | 1 | 4.17 | 0.54 | 2.30 | 3.80 | 4.30 | 4.6 | 5.00 | ▁▁▅▇▇ |
| bty_avg | 0 | 1 | 4.42 | 1.53 | 1.67 | 3.17 | 4.33 | 5.5 | 8.17 | ▃▇▇▃▂ |
When our response and explanatory variable(s) are numerical we can:
\[ \rho (x,y) = \dfrac{\sum_{i=1}^n (x_i -\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}} \]
When our response and explanatory variable(s) are numerical we can:
\[ \rho (x,y) = \dfrac{\sum_{i=1}^n (x_i -\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}} \]
When our response and explanatory variable(s) are categorical we can:
Summarise the variable of interest based on our categorical variable.| continent | median | mean |
|---|---|---|
| Asia | 72.3960 | 70.72848 |
| Europe | 78.6085 | 77.64860 |
| Africa | 52.9265 | 54.80604 |
| Americas | 72.8990 | 73.60812 |
| Oceania | 80.7195 | 80.71950 |
When our response and explanatory variable(s) are categorical we can:
Summarise the variable of interest based on our categorical variable.
Produce boxplots to examine the distribution of a numerical outcome variable across different levels of a categorical variable
Let the general form of the simple linear model:
\[y_i = \beta_0 + \beta_1 x_{i,1} + \epsilon_i, \ i = 1,..n \\ \mathrm{such \ that } \ \epsilon \sim \mathrm{Normal}(0,\sigma^2)\]
used to describe the relationship between a response y and a set of variables or predictors x
Modelling the relationship between the professor’s evaluation score and their beauty rating
| score | ||
| Predictors | Estimates | p |
| (Intercept) | 3.88 | <0.001 |
| bty avg | 0.07 | <0.001 |
| Observations | 463 | |
| R2 / R2 adjusted | 0.035 / 0.033 | |
\[\widehat{\text{score}} = \widehat{\alpha} + \widehat{\beta} x_i = 3.88 + 0.07 \cdot \mathrm{bty\_avg},\]
bty_avg = 0, their average teaching score would be 3.88.bty_avg.
bty_avg, there is an associated increase of, on average, 0.06664 units of score.Residuals against fitted values.
Justified on the basis of the experimental context and are not formally examined.
Here, we will fit a simple linear regression model where the explanatory variable is categorical, i.e. a factor with two or more levels
\[{\text{life exp}} = {\alpha} + {\beta}_{\text{Amer}} \cdot \mathbb{I}_{\text{Amer}}(x) + {\beta}_{\text{Asia}} \cdot \mathbb{I}_{\text{Asia}}(x) + {\beta}_{\text{Euro}} \cdot \mathbb{I}_{\text{Euro}}(x) + {\beta}_{\text{Ocean}} \cdot \mathbb{I}_{\text{Ocean}}(x),\]
the intercept \({\alpha}\) is the mean life expectancy for our baseline category Africa;
\({\beta}_{\text{continent}}\) is the difference in the mean life expectancy of a given continent relative to the baseline category ,i.e., Africa.
\(\mathbb{I}_{\text{continent}}(x)\) is an indicator function such that
\[\mathbb{I}_{\text{continent}}(x)=\left\{ \begin{array}{ll} 1 ~~~ \text{if country} ~ x ~ \text{is in the continent},\\ 0 ~~~ \text{Otherwise}.\\ \end{array} \right.\]
Rows: 142
Columns: 3
$ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", …
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, Asi…
$ lifeExp <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829, 75.6…
| life Exp | ||
| Predictors | Estimates | p |
| (Intercept) | 54.81 | <0.001 |
| continent [Americas] | 18.80 | <0.001 |
| continent [Asia] | 15.92 | <0.001 |
| continent [Europe] | 22.84 | <0.001 |
| continent [Oceania] | 25.91 | <0.001 |
| Observations | 142 | |
| R2 / R2 adjusted | 0.635 / 0.625 | |
Our regression equation is given as:
\[\widehat{\text{life exp}} = \widehat{\alpha} + \sum_j^4 \widehat{\beta}_{j} \mathbb{I}_{\text{continent}_j}(x)\]
| life Exp | ||
| Predictors | Estimates | p |
| (Intercept) | 54.81 | <0.001 |
| continent [Americas] | 18.80 | <0.001 |
| continent [Asia] | 15.92 | <0.001 |
| continent [Europe] | 22.84 | <0.001 |
| continent [Oceania] | 25.91 | <0.001 |
| Observations | 142 | |
| R2 / R2 adjusted | 0.635 / 0.625 | |
Our regression equation is given as:
\[\widehat{\text{life exp}} = \widehat{\alpha} + \sum_j^4 \widehat{\beta}_{j} \mathbb{I}_{\text{continent}_j}(x)\]
\[\widehat{\alpha} + \widehat{\beta}_{\text{Asia}} \cdot \mathbb{I}_{\text{Asia}}(x) = 54.8 + 15.9 \cdot 1 = 70.7 \]
| life Exp | ||
| Predictors | Estimates | p |
| (Intercept) | 54.81 | <0.001 |
| continent [Americas] | 18.80 | <0.001 |
| continent [Asia] | 15.92 | <0.001 |
| continent [Europe] | 22.84 | <0.001 |
| continent [Oceania] | 25.91 | <0.001 |
| Observations | 142 | |
| R2 / R2 adjusted | 0.635 / 0.625 | |
Our regression equation is given as:
\[\widehat{\text{life exp}} = \widehat{\alpha} + \sum_j^4 \widehat{\beta}_{j} \mathbb{I}_{\text{continent}_j}(x)\]
Let the general form of the linear model:
\[y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + ... + \beta_p x_{i,p} + \epsilon_i, \ i = 1,..n\]
\(y_i\) is our response of the \(i^{th}\) observation;
\(\beta_0\) is the intercept;
\(\beta_1\) is the coefficient for the first explanatory variable \(x_1\);
\(\beta_2\) is the coefficient for the second explanatory variable \(x_2\);
\(\beta_p\) is the coefficient for the \(p^{th}\) explanatory variable \(x_p\);
\(\epsilon_i\) is the \(i^{th}\) random error component.
The multiple regression model we will be fitting to the credit balance data is given as:
\[y_i = \alpha + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i, ~~~~ \epsilon \sim N(0, \sigma^2),\]
\[\text{score} = \alpha + \beta \ \text{beauty} + \epsilon ~~\text{such that}~\epsilon \sim \mathcal{N}(0,\sigma^2)\]
| Balance | ||
| Predictors | Estimates | p |
| (Intercept) | -385.18 | <0.001 |
| Limit | 0.26 | <0.001 |
| Income | -7.66 | <0.001 |
| Observations | 400 | |
| R2 / R2 adjusted | 0.871 / 0.870 | |
Our regression equation is given as:
The intercept represents the credit card balance (Balance) of an individual who has $0 for both credit limit (Limit) and income (Income).
The coefficient for credit limit (Limit) tells us that, taking all other variables in the model into account, that there is an associated increase, on average, in credit card balance of $0.26.
The coefficient for income (Income) tells us that, taking all other variables in the model into account, that there is an associated decrease, on average, in credit card balance of $7.66.
Besides our usual diagnostic plots to check for: homoscedasticity, linearity,normality, etc.
Besides our usual diagnostic plots to check for: homoscedasticity, linearity,normality, etc.
We need to consider collinearity among our predictors
Balance Limit Income
Balance 1.0000000 0.8616973 0.4636565
Limit 0.8616973 1.0000000 0.7920883
Income 0.4636565 0.7920883 1.0000000
A high collinearity value may inflate the parameter uncertainty
To check for colliniearity we can compute the VIF (Variance Inflation Factor)
VIF = 1: No collinearity (predictor is uncorrelated with others).
1 \(<\) VIF \(<\) 5: Moderate correlation (usually acceptable).
VIF \(\geq\) 5: High collinearity (may inflate coefficient variance, reducing reliability).